🚀 Exploratory Data Analysis (EDA)

🔧 Subtask 1: Data Overview and Summary Statistics

📋 Implementation Plan

Begin by loading the dataset and generating a comprehensive summary of the data including the count, mean, standard deviation, minimum, maximum, and quartiles for each numerical feature. This will provide an initial understanding of the data's central tendency and spread. Also, review the data types of each column to confirm they are appropriate for their content.

👨‍💻 Developer Code

import pandas as pd
import numpy as np

# Display summary statistics for numerical columns
print("Summary Statistics for Numerical Features:")
print(df.describe())

# Display data types of all columns
print("\nData Types of Each Column:")
print(df.dtypes)

🖥 Execution Result

Summary Statistics for Numerical Features:
                Area    Perimeter  Major_Axis_Length  Minor_Axis_Length  \
count    2500.000000  2500.000000        2500.000000        2500.000000   
mean    80658.220800  1130.279015         456.601840         225.794921   
std     13664.510228   109.256418          56.235704          23.297245   
min     47939.000000   868.485000         320.844600         152.171800   
25%     70765.000000  1048.829750         414.957850         211.245925   
50%     79076.000000  1123.672000         449.496600         224.703100   
75%     89757.500000  1203.340500         492.737650         240.672875   
max    136574.000000  1559.450000         661.911300         305.818000   

         Convex_Area  Equiv_Diameter  Eccentricity     Solidity       Extent  \
count    2500.000000     2500.000000   2500.000000  2500.000000  2500.000000   
mean    81508.084400      319.334230      0.860879     0.989492     0.693205   
std     13764.092788       26.891920      0.045167     0.003494     0.060914   
min     48366.000000      247.058400      0.492100     0.918600     0.468000   
25%     71512.000000      300.167975      0.831700     0.988300     0.658900   
50%     79872.000000      317.305350      0.863700     0.990300     0.713050   
75%     90797.750000      338.057375      0.897025     0.991500     0.740225   
max    138384.000000      417.002900      0.948100     0.994400     0.829600   

         Roundness  Aspect_Ration  Compactness  
count  2500.000000    2500.000000  2500.000000  
mean      0.791533       2.041702     0.704121  
std       0.055924       0.315997     0.053067  
min       0.554600       1.148700     0.560800  
25%       0.751900       1.801050     0.663475  
50%       0.797750       1.984200     0.707700  
75%       0.834325       2.262075     0.743500  
max       0.939600       3.144400     0.904900  

Data Types of Each Column:
Area                   int64
Perimeter            float64
Major_Axis_Length    float64
Minor_Axis_Length    float64
Convex_Area            int64
Equiv_Diameter       float64
Eccentricity         float64
Solidity             float64
Extent               float64
Roundness            float64
Aspect_Ration        float64
Compactness          float64
Class                 object
dtype: object

🔧 Subtask 2: Missing Values and Data Types Analysis

📋 Implementation Plan

Examine the dataset for any missing or null values across all columns and assess the data types of each feature to ensure they are correctly assigned (e.g., numerical features as floats or integers, categorical features as objects). This step helps identify data cleaning needs and confirms data integrity.

👨‍💻 Developer Code

import pandas as pd
import numpy as np

# Check for missing/null values in each column
missing_values = df.isnull().sum()
print("Missing or Null Values per Column:")
print(missing_values)

print("\nData Types of Each Column:")
print(df.dtypes)

🖥 Execution Result

Missing or Null Values per Column:
Area                 0
Perimeter            0
Major_Axis_Length    0
Minor_Axis_Length    0
Convex_Area          0
Equiv_Diameter       0
Eccentricity         0
Solidity             0
Extent               0
Roundness            0
Aspect_Ration        0
Compactness          0
Class                0
dtype: int64

Data Types of Each Column:
Area                   int64
Perimeter            float64
Major_Axis_Length    float64
Minor_Axis_Length    float64
Convex_Area            int64
Equiv_Diameter       float64
Eccentricity         float64
Solidity             float64
Extent               float64
Roundness            float64
Aspect_Ration        float64
Compactness          float64
Class                 object
dtype: object

🔧 Subtask 3: Distribution Analysis of Numerical Features

📋 Implementation Plan

In addition to generating summary statistics for each numerical feature, create visualizations such as histograms or density plots for each feature to analyze their distribution. This will help in understanding the skewness, modality, and range of the data, which is crucial for deciding if transformations or scaling are necessary later in the pipeline. Ensure that these plots are clearly labeled and interpreted to provide insights into the data distribution.

👨‍💻 Developer Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns only (exclude 'Class' which is object)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

print("Summary Statistics for Numerical Features:")
print(df[numerical_cols].describe())

# Plot distribution for each numerical feature
for col in numerical_cols:
    plt.figure(figsize=(8, 5))
    sns.histplot(df[col], kde=True, bins=30, color='skyblue')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.grid(True, linestyle='--', alpha=0.6)
    # Interpretation printout for each feature
    skewness = df[col].skew()
    modality = "likely unimodal"
    # Simple modality check: count peaks in KDE curve (approximate)
    # Here we just note skewness for simplicity
    print(f"\nFeature: {col}")
    print(f" - Skewness: {skewness:.3f}")
    if skewness > 1:
        print(" - Interpretation: Highly right-skewed distribution, consider transformation.")
    elif skewness < -1:
        print(" - Interpretation: Highly left-skewed distribution, consider transformation.")
    else:
        print(" - Interpretation: Approximately symmetric distribution.")

🖥 Execution Result

Summary Statistics for Numerical Features:
                Area    Perimeter  Major_Axis_Length  Minor_Axis_Length  \
count    2500.000000  2500.000000        2500.000000        2500.000000   
mean    80658.220800  1130.279015         456.601840         225.794921   
std     13664.510228   109.256418          56.235704          23.297245   
min     47939.000000   868.485000         320.844600         152.171800   
25%     70765.000000  1048.829750         414.957850         211.245925   
50%     79076.000000  1123.672000         449.496600         224.703100   
75%     89757.500000  1203.340500         492.737650         240.672875   
max    136574.000000  1559.450000         661.911300         305.818000   

         Convex_Area  Equiv_Diameter  Eccentricity     Solidity       Extent  \
count    2500.000000     2500.000000   2500.000000  2500.000000  2500.000000   
mean    81508.084400      319.334230      0.860879     0.989492     0.693205   
std     13764.092788       26.891920      0.045167     0.003494     0.060914   
min     48366.000000      247.058400      0.492100     0.918600     0.468000   
25%     71512.000000      300.167975      0.831700     0.988300     0.658900   
50%     79872.000000      317.305350      0.863700     0.990300     0.713050   
75%     90797.750000      338.057375      0.897025     0.991500     0.740225   
max    138384.000000      417.002900      0.948100     0.994400     0.829600   

         Roundness  Aspect_Ration  Compactness  
count  2500.000000    2500.000000  2500.000000  
mean      0.791533       2.041702     0.704121  
std       0.055924       0.315997     0.053067  
min       0.554600       1.148700     0.560800  
25%       0.751900       1.801050     0.663475  
50%       0.797750       1.984200     0.707700  
75%       0.834325       2.262075     0.743500  
max       0.939600       3.144400     0.904900  

Feature: Area
 - Skewness: 0.496
 - Interpretation: Approximately symmetric distribution.

Feature: Perimeter
 - Skewness: 0.415
 - Interpretation: Approximately symmetric distribution.

Feature: Major_Axis_Length
 - Skewness: 0.503
 - Interpretation: Approximately symmetric distribution.

Feature: Minor_Axis_Length
 - Skewness: 0.104
 - Interpretation: Approximately symmetric distribution.

Feature: Convex_Area
 - Skewness: 0.494
 - Interpretation: Approximately symmetric distribution.

Feature: Equiv_Diameter
 - Skewness: 0.272
 - Interpretation: Approximately symmetric distribution.

Feature: Eccentricity
 - Skewness: -0.749
 - Interpretation: Approximately symmetric distribution.

Feature: Solidity
 - Skewness: -5.691
 - Interpretation: Highly left-skewed distribution, consider transformation.

Feature: Extent
 - Skewness: -1.027
 - Interpretation: Highly left-skewed distribution, consider transformation.

Feature: Roundness
 - Skewness: -0.373
 - Interpretation: Approximately symmetric distribution.

Feature: Aspect_Ration
 - Skewness: 0.548
 - Interpretation: Approximately symmetric distribution.

Feature: Compactness
 - Skewness: -0.062
 - Interpretation: Approximately symmetric distribution.

📈 Visuals

Visual 1

Visual 2

Visual 3

Visual 4

Visual 5

Visual 6

Visual 7

Visual 8

Visual 9

Visual 10

Visual 11

Visual 12


🔧 Subtask 4: Correlation Analysis

📋 Implementation Plan

Calculate the correlation matrix for all numerical features as done, and additionally create a heatmap visualization of this correlation matrix using a suitable plotting library (e.g., seaborn or matplotlib). This visualization should clearly highlight strong positive and negative correlations to help identify feature interdependencies and potential multicollinearity issues, fulfilling the original plan's intent.

👨‍💻 Developer Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()

print("Correlation Matrix:")
print(corr_matrix)

# Plot heatmap of the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', center=0,
            cbar_kws={"shrink": .8}, square=True, linewidths=0.5)

plt.title('Correlation Heatmap of Numerical Features')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.grid(False)

🖥 Execution Result

Correlation Matrix:
                       Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  \
Area               1.000000   0.928548           0.789133           0.685304   
Perimeter          0.928548   1.000000           0.946181           0.392913   
Major_Axis_Length  0.789133   0.946181           1.000000           0.099376   
Minor_Axis_Length  0.685304   0.392913           0.099376           1.000000   
Convex_Area        0.999806   0.929971           0.789061           0.685634   
Equiv_Diameter     0.998464   0.928055           0.787078           0.690020   
Eccentricity       0.159624   0.464601           0.704287          -0.590877   
Solidity           0.158388   0.065340           0.119291           0.090915   
Extent            -0.014018  -0.140600          -0.214990           0.233576   
Roundness         -0.149378  -0.500968          -0.684972           0.558566   
Aspect_Ration      0.159960   0.487880           0.729156          -0.598475   
Compactness       -0.160438  -0.484440          -0.726958           0.603441   

                   Convex_Area  Equiv_Diameter  Eccentricity  Solidity  \
Area                  0.999806        0.998464      0.159624  0.158388   
Perimeter             0.929971        0.928055      0.464601  0.065340   
Major_Axis_Length     0.789061        0.787078      0.704287  0.119291   
Minor_Axis_Length     0.685634        0.690020     -0.590877  0.090915   
Convex_Area           1.000000        0.998289      0.159156  0.139178   
Equiv_Diameter        0.998289        1.000000      0.156246  0.159454   
Eccentricity          0.159156        0.156246      1.000000  0.043991   
Solidity              0.139178        0.159454      0.043991  1.000000   
Extent               -0.015449       -0.010970     -0.327316  0.067537   
Roundness            -0.153615       -0.145313     -0.890651  0.200836   
Aspect_Ration         0.159822        0.155762      0.950225  0.026410   
Compactness          -0.160432       -0.156411     -0.981689 -0.019967   

                     Extent  Roundness  Aspect_Ration  Compactness  
Area              -0.014018  -0.149378       0.159960    -0.160438  
Perimeter         -0.140600  -0.500968       0.487880    -0.484440  
Major_Axis_Length -0.214990  -0.684972       0.729156    -0.726958  
Minor_Axis_Length  0.233576   0.558566      -0.598475     0.603441  
Convex_Area       -0.015449  -0.153615       0.159822    -0.160432  
Equiv_Diameter    -0.010970  -0.145313       0.155762    -0.156411  
Eccentricity      -0.327316  -0.890651       0.950225    -0.981689  
Solidity           0.067537   0.200836       0.026410    -0.019967  
Extent             1.000000   0.352338      -0.329933     0.336984  
Roundness          0.352338   1.000000      -0.935233     0.933308  
Aspect_Ration     -0.329933  -0.935233       1.000000    -0.990778  
Compactness        0.336984   0.933308      -0.990778     1.000000

📈 Visuals

Visual 1


🔧 Subtask 5: Class Distribution Analysis

📋 Implementation Plan

Evaluate the distribution of the target variable 'Class' by counting the instances of each class label. This will help determine if the dataset is balanced or if class imbalance techniques might be required.

👨‍💻 Developer Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Count instances of each class label
class_counts = df['Class'].value_counts()
class_proportions = df['Class'].value_counts(normalize=True)

print("Class Counts:")
print(class_counts)
print("\nClass Proportions:")
print(class_proportions)

# Plot class distribution
plt.figure(figsize=(8, 5))
sns.barplot(x=class_counts.index, y=class_counts.values, palette='viridis')
plt.title('Distribution of Target Variable: Class')
plt.xlabel('Class')
plt.ylabel('Count')
plt.grid(axis='y', linestyle='--', alpha=0.7)

🖥 Execution Result

Class Counts:
Class
Çerçevelik       1300
Ürgüp Sivrisi    1200
Name: count, dtype: int64

Class Proportions:
Class
Çerçevelik       0.52
Ürgüp Sivrisi    0.48
Name: proportion, dtype: float64

📈 Visuals

Visual 1


🔧 Subtask 6: Outlier Detection

📋 Implementation Plan

Identify potential outliers in numerical features using statistical methods such as the interquartile range (IQR) or visualization techniques like boxplots. This step is crucial for understanding data variability and deciding on outlier treatment strategies.

👨‍💻 Developer Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

print("Outlier Detection using IQR method:")
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    num_outliers = outliers.shape[0]

    print(f"\nFeature: {col}")
    print(f" - Q1: {Q1:.3f}, Q3: {Q3:.3f}, IQR: {IQR:.3f}")
    print(f" - Lower Bound: {lower_bound:.3f}, Upper Bound: {upper_bound:.3f}")
    print(f" - Number of potential outliers: {num_outliers}")

    # Boxplot to visualize outliers
    plt.figure(figsize=(8, 5))
    sns.boxplot(x=df[col], color='lightcoral')
    plt.title(f'Boxplot of {col} (Outlier Detection)')
    plt.xlabel(col)
    plt.grid(True, linestyle='--', alpha=0.6)

🖥 Execution Result

Outlier Detection using IQR method:

Feature: Area
 - Q1: 70765.000, Q3: 89757.500, IQR: 18992.500
 - Lower Bound: 42276.250, Upper Bound: 118246.250
 - Number of potential outliers: 18

Feature: Perimeter
 - Q1: 1048.830, Q3: 1203.341, IQR: 154.511
 - Lower Bound: 817.064, Upper Bound: 1435.107
 - Number of potential outliers: 16

Feature: Major_Axis_Length
 - Q1: 414.958, Q3: 492.738, IQR: 77.780
 - Lower Bound: 298.288, Upper Bound: 609.407
 - Number of potential outliers: 21

Feature: Minor_Axis_Length
 - Q1: 211.246, Q3: 240.673, IQR: 29.427
 - Lower Bound: 167.106, Upper Bound: 284.813
 - Number of potential outliers: 30

Feature: Convex_Area
 - Q1: 71512.000, Q3: 90797.750, IQR: 19285.750
 - Lower Bound: 42583.375, Upper Bound: 119726.375
 - Number of potential outliers: 17

Feature: Equiv_Diameter
 - Q1: 300.168, Q3: 338.057, IQR: 37.889
 - Lower Bound: 243.334, Upper Bound: 394.891
 - Number of potential outliers: 13

Feature: Eccentricity
 - Q1: 0.832, Q3: 0.897, IQR: 0.065
 - Lower Bound: 0.734, Upper Bound: 0.995
 - Number of potential outliers: 18

Feature: Solidity
 - Q1: 0.988, Q3: 0.992, IQR: 0.003
 - Lower Bound: 0.983, Upper Bound: 0.996
 - Number of potential outliers: 103

Feature: Extent
 - Q1: 0.659, Q3: 0.740, IQR: 0.081
 - Lower Bound: 0.537, Upper Bound: 0.862
 - Number of potential outliers: 46

Feature: Roundness
 - Q1: 0.752, Q3: 0.834, IQR: 0.082
 - Lower Bound: 0.628, Upper Bound: 0.958
 - Number of potential outliers: 5

Feature: Aspect_Ration
 - Q1: 1.801, Q3: 2.262, IQR: 0.461
 - Lower Bound: 1.110, Upper Bound: 2.954
 - Number of potential outliers: 11

Feature: Compactness
 - Q1: 0.663, Q3: 0.744, IQR: 0.080
 - Lower Bound: 0.543, Upper Bound: 0.864
 - Number of potential outliers: 2

📈 Visuals

Visual 1

Visual 2

Visual 3

Visual 4

Visual 5

Visual 6

Visual 7

Visual 8

Visual 9

Visual 10

Visual 11

Visual 12


🔧 Subtask 7: Feature Relationships Visualization

📋 Implementation Plan

Visualize relationships between pairs of features, especially those with strong correlations (|correlation| > 0.7), using scatter plots or pair plots. Specifically, create pair plots for the identified strongly correlated features such as 'Area', 'Perimeter', 'Major_Axis_Length', 'Convex_Area', and 'Equiv_Diameter', including the 'Class' variable as hue to observe class-wise distributions. Ensure the visualizations are generated and saved or displayed to facilitate understanding of data structure and potential feature engineering opportunities.

👨‍💻 Developer Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()

# Identify strongly correlated features with absolute correlation > 0.7 (excluding self-correlation)
strong_corr_pairs = []
for col1 in numerical_cols:
    for col2 in numerical_cols:
        if col1 != col2:
            corr_val = corr_matrix.loc[col1, col2]
            if abs(corr_val) > 0.7:
                pair = tuple(sorted([col1, col2]))
                if pair not in strong_corr_pairs:
                    strong_corr_pairs.append(pair)

# Extract unique features involved in strong correlations
strong_corr_features = sorted(set([feat for pair in strong_corr_pairs for feat in pair]))

# Ensure the specified features are included (as per instruction)
specified_features = ['Area', 'Perimeter', 'Major_Axis_Length', 'Convex_Area', 'Equiv_Diameter']
for feat in specified_features:
    if feat not in strong_corr_features:
        strong_corr_features.append(feat)

# Filter features to those present in df columns (just in case)
strong_corr_features = [f for f in strong_corr_features if f in df.columns]

print("Strongly correlated features (|corr| > 0.7):")
print(strong_corr_features)

# Create pair plot for these features with 'Class' as hue
plt.figure(figsize=(12, 10))
pairplot = sns.pairplot(df, vars=strong_corr_features, hue='Class', palette='Set2', diag_kind='kde', plot_kws={'alpha':0.6, 's':30})
pairplot.fig.suptitle('Pair Plot of Strongly Correlated Features with Class Hue', y=1.02)

# Save the pairplot figure
pairplot.savefig("../output/tables/pairplot_strong_corr_features.png")

print("Pair plot saved to '../output/tables/pairplot_strong_corr_features.png'")

🖥 Execution Result

Strongly correlated features (|corr| > 0.7):
['Area', 'Aspect_Ration', 'Compactness', 'Convex_Area', 'Eccentricity', 'Equiv_Diameter', 'Major_Axis_Length', 'Perimeter', 'Roundness']
Pair plot saved to '../output/tables/pairplot_strong_corr_features.png'

📈 Visuals

Visual 1

Visual 2


🔧 Subtask 8: Data Quality Checks

📋 Implementation Plan

Perform data quality checks such as verifying unique values for categorical features, checking for duplicate rows, and ensuring consistency in feature scales and units. This step ensures the dataset is clean and reliable for subsequent modeling tasks.

👨‍💻 Developer Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Check unique values for categorical features
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Unique values per categorical feature:")
for col in categorical_cols:
    unique_vals = df[col].unique()
    num_unique = df[col].nunique()
    print(f" - {col}: {num_unique} unique values -> {unique_vals}")

# 2. Check for duplicate rows
num_duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows in dataset: {num_duplicates}")

# 3. Check consistency in feature scales and units
#    - Summary statistics (already done previously, but we print min/max again for clarity)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("\nNumerical features scale check (min and max values):")
for col in numerical_cols:
    min_val = df[col].min()
    max_val = df[col].max()
    print(f" - {col}: min = {min_val}, max = {max_val}")

# 4. Visual check for scale consistency using boxplots for numerical features
for col in numerical_cols:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=df[col], color='lightblue')
    plt.title(f'Boxplot for {col} to check scale and outliers')
    plt.xlabel(col)
    plt.grid(True, linestyle='--', alpha=0.6)

# 5. Check for inconsistent units or suspicious values by comparing related features
# For example, check if Area and Convex_Area are consistent (Convex_Area should be >= Area)
inconsistent_area = df[df['Convex_Area'] < df['Area']]
print(f"\nNumber of rows where Convex_Area < Area (possible inconsistency): {inconsistent_area.shape[0]}")

# Check if Aspect_Ration (Aspect_Ratio) values are reasonable (positive and within expected range)
if 'Aspect_Ration' in df.columns:
    invalid_aspect_ratio = df[(df['Aspect_Ration'] <= 0) | (df['Aspect_Ration'] > 10)]
    print(f"Number of rows with invalid Aspect_Ration values (<=0 or >10): {invalid_aspect_ratio.shape[0]}")

# Check for any missing values in the dataset
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values[missing_values > 0] if missing_values.any() else "No missing values detected.")

🖥 Execution Result

Unique values per categorical feature:
 - Class: 2 unique values -> ['Çerçevelik' 'Ürgüp Sivrisi']

Number of duplicate rows in dataset: 0

Numerical features scale check (min and max values):
 - Area: min = 47939, max = 136574
 - Perimeter: min = 868.485, max = 1559.45
 - Major_Axis_Length: min = 320.8446, max = 661.9113
 - Minor_Axis_Length: min = 152.1718, max = 305.818
 - Convex_Area: min = 48366, max = 138384
 - Equiv_Diameter: min = 247.0584, max = 417.0029
 - Eccentricity: min = 0.4921, max = 0.9481
 - Solidity: min = 0.9186, max = 0.9944
 - Extent: min = 0.468, max = 0.8296
 - Roundness: min = 0.5546, max = 0.9396
 - Aspect_Ration: min = 1.1487, max = 3.1444
 - Compactness: min = 0.5608, max = 0.9049

Number of rows where Convex_Area < Area (possible inconsistency): 0
Number of rows with invalid Aspect_Ration values (<=0 or >10): 0

Missing values per column:
No missing values detected.

📈 Visuals

Visual 1

Visual 2

Visual 3

Visual 4

Visual 5

Visual 6

Visual 7

Visual 8

Visual 9

Visual 10

Visual 11

Visual 12